github header

Housing Market - Paper2

1 Introduction

This paper continues our discussion on Seattle housing prices from our first project. As mentioned in our earlier report, “Seattle boasts among ‘hottest’ housing markets in the country; as of July 2018, Seattle ‘led the nation in home price gains’ for 21 straight months.’” Given this context, our basic SMART question remains the same: how can we predict house prices in Seattle? This results to be a regression problem on the target variable of price, and we will apply 4 different approaches learned throughout the course to try to solve it.

Included in our discussion is some exploratory data analysis (EDA) from our first assignment, along with some new models, including KNN, ridge and lasso regression, PCA/PCR, and decision trees/random forest. Each model comes with its own advantages and disadvantages, and as such no single model offers total explanatory power. Our hope, however, is that the sum is greater than its parts.

2 EDA

The following are excerpts and graphs from the EDA section of our previous report. We are including them here to remind the reader of our dataset’s attributes, before we dive into the analysis.

## 'data.frame':    21613 obs. of  21 variables:
##  $ id           : num  7129300520 6414100192 5631500400 2487200875 1954400510 ...
##  $ date         : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...

“A brief overview of the dataset yields the following observations for housing price: the minimum price is $78,000, while the maximum is $7,700,000 (quite a large range); the mean of the dataset is $540,198 (indicating that the dataset is right-skewed, as further indicated by the histogram below); the standard deviation of the dataset is $367,142; and the variance is 134,792,956,735 (quite large, indicating "that the data points are very spread out from the mean, and from one another".”

Below is a visualization of the points in the dataset by price on a map, plotted with the leaflet library. Note that the data have been divided by unequal bins to provide a better visualization of the distribution of housing price, so please read the legend carefully. More expensive houses tend to be concentrated near the water and center of the city.

2.1 Important Features Comparisons

From the scatterplot, it’s apparent that there is a relatively strong, positive correlation between housing price and living space (.70192, to be exact). That is, as living space increases, so does housing price. Note that a majority of the data points lie below 6,000 sqft, and below $2 million.

Here we have a boxplot comparing “grade” with housing price. “Grade” represents an index from 1 to 13, with the lowest number representing poor construction and design. The trend is clear: construction and design grade correlate positively with housing price.

3 KNN

Here, we convert pass “price” through a log function to better normalize its distribution (unnormalized, it has a heavy right skew). After passing “price” through the log function, we convert “price” to a factor variable and divide it into 3 categories (“Low”, “Medium”, and “High”) to prepare for KNN analysis.

## [1] 11.3 15.9
##    Low Medium   High 
##   7215  13996    385

Next, we split the data into 80% training, and 20% test subsets.

## [1] 0.8

We then apply the chooseK() function to determine the best KNN k value for our dataset. Within the chooseK() function itself, I select features that are truly numeric (KNN requires predictor variables to be numeric). For instance, even though “yr_built” is classified as an “integer” data type, a concept such as year is best thought of as categorical, not numerical. As a result, I chose to exclude this and other similar variables from the analysis.

From the resulting graph, it becomes evident that 12 is approximately the best value for k: it offers the highest accuracy.

Now that we have our k value, we can run our KNN analysis, using the same features we fed the chooseK() function previously.

##  Factor w/ 3 levels "Low","Medium",..: 1 2 2 2 2 2 2 2 2 2 ...
## price_predict
##    Low Medium   High 
##   1251   3055     13

Now let’s take a look at the results. Our KNN model classified housing price correctly about 74% of the time, or 3199 out of a possible 4319 test cases.

It misclassified 385 medium priced houses as “low”, 661 low priced houses as “medium”, 4 medium priced houses as “high”, and 70 high priced houses “medium”.

KNN is a useful algorithm for classifying data points. We showed that at 74% accuracy, our algorithm successfully predicted housing price categories based on variables such as “bedrooms”, “bathrooms”, “sqft_living”, “sqft_lot”, “floors”, “sqft_above”, “sqft_basement”, “sqft_living15”, and “sqft_lot15”. If we were in the real estate market and wanted to know generally how high or low we should price a house, we could determine an answer based on these variables.

It should be noted, however, that KNN cannot be used to predict numeric prices, since the response variable (price) must be categorical. To predict specific prices, one must use linear regression or PCR.

##              
## price_predict  Low Medium High
##        Low     830    421    0
##        Medium  636   2346   73
##        High      0      6    7
## [1]  830 2346    7

KNN results in an accuracy of 0.737 eventually.

4 Tree Modeling

We decided to also use a Decision Tree modeling approach for this regression problem, and see if we could also gain interesting visualizations of the dataset to predict the value of price on average.

4.1 Regression Tree

Using the ‘tree’ package, we built a tree on all the features used as predictors (i.e. excluding id, date, the various geographic variables and sqft_living15 and sqft_lot15) for price. We used the logarithmic of the price to provide an easier visualization of the tree and allowed more precide average values per node. This tree presented 9 termindal nodes and a mean squared error of 0.12.

From the plot of the tree, we can see that the algorithm splits the data using grade, yr_built and sqft_living. As expected as a general trend, if grade is better and sqft_living are larger, price would be higher. While if grade is lower and houses get smaller, price would then be likely to be lower. The splits also show an interesting fact that the age of house correlates positively with its price, as in splits based on yr_built the older the house the higher the average house price would be. As we used the log of price to predict price with this tree, we can see how the highest average of price in the farthest right leaf is 1289802.934 dollars and the lowest average of price in the farthest left is 273758.059 dollars.

Furthermore, we use the ‘rpart’ package to build fancier visualization of decision trees for this dataset.

These trees provide a fairly similar result from the classic tree, but from which we can observe the proportion of sample observations present in each terminal leaf (in this case only 7). Again, the variables used to build the tree are grade, sqft_living and yr_built. As expected, the right branches of the tree only contain 20% of the houses, as the most expensive houses with better grade than the rest; indeed, the houses that have better grande and sqft superior than 3757, result to be only 5% of the data. Whereas, the largest proportions are within the left branches given their lower grade: 80% of house are of grade between 3 and 8 and 52% of grade 3 to 7. The terminal node resulting with highest proportion is the one containing observations of house of grade 7 and built after 1953, namely 29% of the observations in the dataset.

On another note, we decided to experiment with trees and build one that would provide an average of prices per different geographic areas. Thus, using the longitude and latitude as predictors for the log of price, we built the following tree which shows the mean value of price per each are on a map, divided by nodes on longitude and latitude.

This visualization is helpful to note that higher prices are corresponding to the central area, i.e. Seattle downtown and adjacent zones, where the average price is estimated to be 1088161.355, while areas on the outskirts of the region shows average values of 327747.902 or 296558.565, as one would normally expect.

4.2 Prune tree

Subsequently, we pruned the tree down with a normal ‘prune’ function.

First of all, we can observe the plot of a sequence of differenct pruned trees’ sizes versus their error rates. The vector of these error rates for the pruning of the tree in our case resulted in 2593.61, 2668.388, 2744.88, 2875.801, 3268.008, 3489.958, 4016.003, 5984.489.

As we intended to represent the smallest optimal tree among these pruned trees, this appeared to be of a size equal to 9. The plot then shows a pruned tree with 6 terminal nodes, and splits operated again on grade, sqft_living and yr_built. This tree results simpler to read given the fewer split, although it does not add predictive power while actually taking away some important splits from a tree model that already simplifies the dataset significantly. Thus, we believed it was better to use a non-pruned tree for testing our model.

4.3 Testing model

First, to test the model perfomance, we divided the dataset in training and test set, where the latter is simply the first fold of the data.

## 
## Regression tree:
## rpart(formula = log(price) ~ bedrooms + bathrooms + sqft_living + 
##     sqft_lot + floors + condition + grade + sqft_above + sqft_basement + 
##     yr_built + yr_renovated, data = train.set)
## 
## Variables actually used in tree construction:
## [1] grade       sqft_living yr_built   
## 
## Root node error: 5390/19418 = 0.3
## 
## n= 19418 
## 
##     CP nsplit rel error xerror  xstd
## 1 0.33      0       1.0    1.0 0.012
## 2 0.09      1       0.7    0.7 0.007
## 3 0.04      2       0.6    0.6 0.007
## 4 0.03      3       0.5    0.5 0.006
## 5 0.02      5       0.5    0.5 0.005
## 6 0.01      6       0.5    0.5 0.005
## 7 0.01      7       0.4    0.5 0.005
## 8 0.01      8       0.4    0.4 0.005

## [1] 0.01

## Call:
## rpart(formula = log(price) ~ bedrooms + bathrooms + sqft_living + 
##     sqft_lot + floors + condition + grade + sqft_above + sqft_basement + 
##     yr_built + yr_renovated, data = train.set)
##   n= 19418 
## 
##       CP nsplit rel error xerror    xstd
## 1 0.3284      0     1.000  1.000 0.01173
## 2 0.0887      1     0.672  0.672 0.00737
## 3 0.0377      2     0.583  0.583 0.00655
## 4 0.0324      3     0.545  0.548 0.00586
## 5 0.0217      5     0.480  0.497 0.00545
## 6 0.0200      6     0.459  0.472 0.00527
## 
## Variable importance
##         grade   sqft_living    sqft_above     bathrooms      yr_built 
##            43            18            18             9             7 
##        floors sqft_basement      bedrooms 
##             2             2             1 
## 
## Node number 1: 19418 observations,    complexity param=0.328
##   mean=13, MSE=0.278 
##   left son=2 (15556 obs) right son=3 (3862 obs)
##   Primary splits:
##       grade       splits as  LLLLLLRRRRR, improve=0.328, (0 missing)
##       sqft_living < 2440   to the left,   improve=0.317, (0 missing)
##       sqft_above  < 2000   to the left,   improve=0.233, (0 missing)
##       bathrooms   < 2.62   to the left,   improve=0.181, (0 missing)
##       bedrooms    < 3.5    to the left,   improve=0.112, (0 missing)
##   Surrogate splits:
##       sqft_above    < 2500   to the left,  agree=0.883, adj=0.413, (0 split)
##       sqft_living   < 2920   to the left,  agree=0.880, adj=0.395, (0 split)
##       bathrooms     < 3.12   to the left,  agree=0.839, adj=0.191, (0 split)
##       sqft_basement < 1520   to the left,  agree=0.808, adj=0.035, (0 split)
##       yr_built      < 2010   to the left,  agree=0.802, adj=0.004, (0 split)
## 
## Node number 2: 15556 observations,    complexity param=0.0887
##   mean=12.9, MSE=0.18 
##   left son=4 (10110 obs) right son=5 (5446 obs)
##   Primary splits:
##       grade         splits as  LLLLLR-----, improve=0.1710, (0 missing)
##       sqft_living   < 2000   to the left,   improve=0.1660, (0 missing)
##       bathrooms     < 1.62   to the left,   improve=0.0974, (0 missing)
##       sqft_above    < 1410   to the left,   improve=0.0972, (0 missing)
##       sqft_basement < 75     to the left,   improve=0.0688, (0 missing)
##   Surrogate splits:
##       bathrooms   < 2.12   to the left,  agree=0.735, adj=0.244, (0 split)
##       sqft_above  < 1770   to the left,  agree=0.735, adj=0.244, (0 split)
##       sqft_living < 2140   to the left,  agree=0.735, adj=0.244, (0 split)
##       floors      < 1.75   to the left,  agree=0.725, adj=0.214, (0 split)
##       yr_built    < 1990   to the left,  agree=0.707, adj=0.162, (0 split)
## 
## Node number 3: 3862 observations,    complexity param=0.0377
##   mean=13.7, MSE=0.212 
##   left son=6 (2950 obs) right son=7 (912 obs)
##   Primary splits:
##       sqft_living   < 3760   to the left,   improve=0.248, (0 missing)
##       grade         splits as  ------LRRRR, improve=0.227, (0 missing)
##       bathrooms     < 3.12   to the left,   improve=0.190, (0 missing)
##       sqft_above    < 3840   to the left,   improve=0.151, (0 missing)
##       sqft_basement < 558    to the left,   improve=0.120, (0 missing)
##   Surrogate splits:
##       sqft_above    < 3760   to the left,   agree=0.900, adj=0.576, (0 split)
##       grade         splits as  ------LLRRR, agree=0.836, adj=0.304, (0 split)
##       bathrooms     < 3.62   to the left,   agree=0.827, adj=0.266, (0 split)
##       sqft_basement < 1220   to the left,   agree=0.803, adj=0.166, (0 split)
##       bedrooms      < 5.5    to the left,   agree=0.774, adj=0.042, (0 split)
## 
## Node number 4: 10110 observations,    complexity param=0.0324
##   mean=12.8, MSE=0.158 
##   left son=8 (2066 obs) right son=9 (8044 obs)
##   Primary splits:
##       grade         splits as  LLLLR------, improve=0.1050, (0 missing)
##       sqft_living   < 1500   to the left,   improve=0.1020, (0 missing)
##       sqft_basement < 30     to the left,   improve=0.0818, (0 missing)
##       yr_built      < 1930   to the right,  improve=0.0616, (0 missing)
##       bathrooms     < 1.62   to the left,   improve=0.0572, (0 missing)
##   Surrogate splits:
##       sqft_living < 935    to the left,  agree=0.836, adj=0.196, (0 split)
##       sqft_above  < 815    to the left,  agree=0.826, adj=0.150, (0 split)
##       bedrooms    < 1.5    to the left,  agree=0.802, adj=0.033, (0 split)
##       bathrooms   < 0.875  to the left,  agree=0.799, adj=0.017, (0 split)
##       condition   splits as  LLRRR,      agree=0.798, adj=0.009, (0 split)
## 
## Node number 5: 5446 observations,    complexity param=0.0217
##   mean=13.1, MSE=0.134 
##   left son=10 (4195 obs) right son=11 (1251 obs)
##   Primary splits:
##       yr_built      < 1960   to the right, improve=0.1610, (0 missing)
##       sqft_living   < 2440   to the left,  improve=0.0900, (0 missing)
##       sqft_basement < 465    to the left,  improve=0.0689, (0 missing)
##       condition     splits as  RLLLR,      improve=0.0491, (0 missing)
##       yr_renovated  < 978    to the left,  improve=0.0409, (0 missing)
##   Surrogate splits:
##       yr_renovated < 978    to the left,  agree=0.798, adj=0.120, (0 split)
##       bathrooms    < 1.62   to the right, agree=0.790, adj=0.088, (0 split)
##       condition    splits as  RLLLR,      agree=0.787, adj=0.074, (0 split)
##       sqft_living  < 4380   to the left,  agree=0.771, adj=0.003, (0 split)
##       sqft_lot     < 440000 to the left,  agree=0.771, adj=0.002, (0 split)
## 
## Node number 6: 2950 observations
##   mean=13.5, MSE=0.142 
## 
## Node number 7: 912 observations
##   mean=14.1, MSE=0.218 
## 
## Node number 8: 2066 observations
##   mean=12.5, MSE=0.156 
## 
## Node number 9: 8044 observations,    complexity param=0.0324
##   mean=12.8, MSE=0.137 
##   left son=18 (5610 obs) right son=19 (2434 obs)
##   Primary splits:
##       yr_built      < 1950   to the right, improve=0.1650, (0 missing)
##       sqft_living   < 2000   to the left,  improve=0.0766, (0 missing)
##       sqft_lot      < 6520   to the right, improve=0.0662, (0 missing)
##       sqft_basement < 30     to the left,  improve=0.0604, (0 missing)
##       condition     splits as  LLLLR,      improve=0.0240, (0 missing)
##   Surrogate splits:
##       bedrooms     < 2.5    to the right, agree=0.736, adj=0.128, (0 split)
##       sqft_above   < 955    to the right, agree=0.725, adj=0.092, (0 split)
##       yr_renovated < 970    to the left,  agree=0.718, adj=0.069, (0 split)
##       bathrooms    < 1.12   to the right, agree=0.717, adj=0.063, (0 split)
##       sqft_living  < 955    to the right, agree=0.712, adj=0.048, (0 split)
## 
## Node number 10: 4195 observations
##   mean=13.1, MSE=0.105 
## 
## Node number 11: 1251 observations
##   mean=13.4, MSE=0.136 
## 
## Node number 18: 5610 observations
##   mean=12.7, MSE=0.11 
## 
## Node number 19: 2434 observations
##   mean=13.1, MSE=0.125

## NULL
## [1] Inf

Building the tress on training dataset, we obtain a slighlty different tree pruned to 5 leaves, which splits data on grade first and then sqft_living. The plot of errors versus size also points out an optimal tree at 34 nodes.

## n= 19418 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 19418 5390 13.0  
##    2) grade=3,4,5,6,7,8 15556 2800 12.9  
##      4) grade=3,4,5,6,7 10110 1590 12.8  
##        8) grade=3,4,5,6 2066  323 12.5 *
##        9) grade=7 8044 1100 12.8  
##         18) yr_built>=1.95e+03 5610  618 12.7 *
##         19) yr_built< 1.95e+03 2434  305 13.1 *
##      5) grade=8 5446  727 13.1  
##       10) yr_built>=1.96e+03 4195  440 13.1 *
##       11) yr_built< 1.96e+03 1251  171 13.4 *
##    3) grade=9,10,11,12,13 3862  820 13.7  
##      6) sqft_living< 3.76e+03 2950  418 13.5 *
##      7) sqft_living>=3.76e+03 912  199 14.1 *

Evaluating this tree on test data, we can see how the trained model did a good job at predicting price for the dataset as errors and tree are almost identical.

## [1] 413761889192

4.4 Random Forest

Ultimately, we used a Random Forest algorithm to evaluate if this type of ensembling could increase the perfomance of the tree model on our dataset.

## 
## Call:
##  randomForest(formula = log(price) ~ bedrooms + bathrooms + sqft_living +      sqft_lot + floors + condition + grade + sqft_above + sqft_basement +      yr_built + yr_renovated, data = train.set, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.0871
##                     % Var explained: 68.6

Thus, we ran a regression random forest on all the predictors variables for the logarithmic of price as target. The model builds 500 trees ensembled with 3 variables tried at each split, and it eventually provides a 68.5% of variance explained and a MSE of 0.0875.

5 PCA

Here, we use principle component analysis to predict house price. We chose this method because there are too many variables, and did not know which ones to use. We want to capture as much information as possible by the fewest number of variables.

5.1 Subset Data

We deleted some unuseful variables which we presumed were uncorrelated to house price such as lattitude, longitude, the 15 neighborhoods’ sqft_living , sqft_lot, the year of renovated, zipcode, date of record, and id. Also, there are several variables containing mostly 0 values, so we also deleted them (these include “view” and “waterfront”). We then took a look at the datset.

## 'data.frame':    21613 obs. of  11 variables:
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...

There are 11 variables left and one variable is house, which will be used as an independent variable. All variables are numeric.

Next, we need to check if there are NA values in the dataset.

There are no NA values.

##               bedrooms bathrooms sqft_living sqft_lot  floors condition
## bedrooms        1.0000    0.5159      0.5767  0.03170  0.1754   0.02847
## bathrooms       0.5159    1.0000      0.7547  0.08774  0.5007  -0.12498
## sqft_living     0.5767    0.7547      1.0000  0.17283  0.3539  -0.05875
## sqft_lot        0.0317    0.0877      0.1728  1.00000 -0.0052  -0.00896
## floors          0.1754    0.5007      0.3539 -0.00520  1.0000  -0.26377
## condition       0.0285   -0.1250     -0.0588 -0.00896 -0.2638   1.00000
## grade           0.3570    0.6650      0.7627  0.11362  0.4582  -0.14467
## sqft_above      0.4776    0.6853      0.8766  0.18351  0.5239  -0.15821
## sqft_basement   0.3031    0.2838      0.4350  0.01529 -0.2457   0.17410
## yr_built        0.1542    0.5060      0.3180  0.05308  0.4893  -0.36142
##                grade sqft_above sqft_basement yr_built
## bedrooms       0.357     0.4776        0.3031   0.1542
## bathrooms      0.665     0.6853        0.2838   0.5060
## sqft_living    0.763     0.8766        0.4350   0.3180
## sqft_lot       0.114     0.1835        0.0153   0.0531
## floors         0.458     0.5239       -0.2457   0.4893
## condition     -0.145    -0.1582        0.1741  -0.3614
## grade          1.000     0.7559        0.1684   0.4470
## sqft_above     0.756     1.0000       -0.0519   0.4239
## sqft_basement  0.168    -0.0519        1.0000  -0.1331
## yr_built       0.447     0.4239       -0.1331   1.0000

Next, we took a look at the correlation of the 10 dependent variables, and we found that they are all related, and some are highly related. This demonstrates that it is difficult to determine which ones are important, so we used PCA to reduce dimensionality.

5.2 PCA part

As the 10 variables have different scales, it is necessary to scale them before analysis. Performing PCA on un-normalized variables will heavily weight variables with high variances.

We used the prcomp function to perform PCA, and subsequently checked the mean and sd of the variables. Since we scaled the variables, the differences in mean and sd are not large.

## Importance of components:
##                          PC1   PC2   PC3    PC4    PC5    PC6    PC7
## Standard deviation     2.071 1.322 1.007 0.9159 0.7926 0.7527 0.6683
## Proportion of Variance 0.429 0.175 0.101 0.0839 0.0628 0.0567 0.0447
## Cumulative Proportion  0.429 0.604 0.705 0.7890 0.8518 0.9085 0.9531
##                           PC8    PC9               PC10
## Standard deviation     0.5060 0.4611 0.0000000000000139
## Proportion of Variance 0.0256 0.0213 0.0000000000000000
## Cumulative Proportion  0.9787 1.0000 1.0000000000000000
##               bedrooms              bathrooms            sqft_living 
##  0.0000000000000002143 -0.0000000000000001689  0.0000000000000002410 
##               sqft_lot                 floors              condition 
##  0.0000000000000000132 -0.0000000000000000227 -0.0000000000000002160 
##                  grade             sqft_above          sqft_basement 
##  0.0000000000000002022  0.0000000000000001110  0.0000000000000000207 
##               yr_built 
##  0.0000000000000019023
##      bedrooms     bathrooms   sqft_living      sqft_lot        floors 
##         0.930         0.770       918.441     41420.512         0.540 
##     condition         grade    sqft_above sqft_basement      yr_built 
##         0.651         1.175       828.091       442.575        29.373

We can see that with 7 components, 95% of the variance is explained.

Let’s take a look at how the variables form each component.

##                   PC1     PC2     PC3     PC4     PC5     PC6     PC7
## bedrooms       0.2873 -0.3055  0.1563 -0.0402  0.7428 -0.3876  0.0572
## bathrooms      0.4212 -0.0715  0.0984  0.0443 -0.1394 -0.2240 -0.1461
## sqft_living    0.4353 -0.2450 -0.0401  0.0120 -0.0186  0.2590  0.0456
## sqft_lot       0.0800 -0.0511 -0.9540  0.1245  0.0366 -0.1928 -0.1442
## floors         0.2961  0.3859  0.1091 -0.2784 -0.0676 -0.1257 -0.7552
## condition     -0.1103 -0.4417 -0.0495 -0.7460 -0.3469 -0.3140  0.1046
## grade          0.4091 -0.0096 -0.0230 -0.0549 -0.2796  0.3578  0.1993
## sqft_above     0.4318  0.0462 -0.1214 -0.2472  0.1525  0.3205  0.2049
## sqft_basement  0.0954 -0.5949  0.1439  0.4875 -0.3240 -0.0621 -0.2887
## yr_built       0.2853  0.3726  0.0624  0.2121 -0.3096 -0.5886  0.4540
##                   PC8     PC9                   PC10
## bedrooms       0.2925  0.0855  0.0000000000000014791
## bathrooms     -0.6178  0.5773  0.0000000000000023896
## sqft_living   -0.1601 -0.4058 -0.6992603665794965284
## sqft_lot       0.0600  0.0500 -0.0000000000000001189
## floors         0.2221 -0.1840  0.0000000000000002764
## condition      0.0313 -0.0540  0.0000000000000001844
## grade          0.6000  0.4723  0.0000000000000002236
## sqft_above    -0.2521 -0.3265  0.6304719252550327058
## sqft_basement  0.1395 -0.2312  0.3369571058700506772
## yr_built       0.1010 -0.2691  0.0000000000000000411

We can see that sqft_above forms up nearly 43% of the first component. The second component consists mostly of sqft_basement, which hovers near 60%.

We can visualize the variance explained by each component. We can see that the first component explains the most, while the subsequent ones explain less.

We can visualize how each component is formed by the different variables. However, the graph is very hard to see – much harder than the rotation vizualization.

Visualizing the cumulative proportions of variance, we can see that after 7 components, the curve becomes smooth.

5.3 Predicting using PCA

Finally, we created a PCR model wth the 7 components. Actually, the PCR model is a linear model, so we used the lm function to create it.

## 
## Call:
## lm(formula = house2$price ~ houseprice[, 1:7])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1353938  -117818   -11744    91786  4213791 
## 
## Coefficients:
##                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)            540088       1554  347.62 < 0.0000000000000002 ***
## houseprice[, 1:7]PC1   109184        750  145.53 < 0.0000000000000002 ***
## houseprice[, 1:7]PC2   -78959       1175  -67.21 < 0.0000000000000002 ***
## houseprice[, 1:7]PC3    -9625       1543   -6.24        0.00000000046 ***
## houseprice[, 1:7]PC4   -37261       1696  -21.97 < 0.0000000000000002 ***
## houseprice[, 1:7]PC5   -58357       1960  -29.77 < 0.0000000000000002 ***
## houseprice[, 1:7]PC6   171526       2064   83.10 < 0.0000000000000002 ***
## houseprice[, 1:7]PC7   -34516       2325  -14.85 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 228000 on 21605 degrees of freedom
## Multiple R-squared:  0.613,  Adjusted R-squared:  0.613 
## F-statistic: 4.89e+03 on 7 and 21605 DF,  p-value: <0.0000000000000002

We can see that 61.3% of the variance is explained by the independent variables. We do not use accuracy to check whether the model is good or not – the reason being that the difference between the house prices is very large. So using accuracy to check the model would not make sense.

We want to see how good the PCR model is, so we created a full linear model to compare it to.

## 
## Call:
## lm(formula = price ~ . - sqft_basement, data = house2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1330687  -116072   -12249    88891  4428176 
## 
## Coefficients:
##                 Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept) 6899404.9190  135999.6452   50.73 < 0.0000000000000002 ***
## bedrooms     -49150.7573    2112.1474  -23.27 < 0.0000000000000002 ***
## bathrooms     49461.3702    3632.5990   13.62 < 0.0000000000000002 ***
## sqft_living     203.6879       4.7239   43.12 < 0.0000000000000002 ***
## sqft_lot         -0.2216       0.0384   -5.77     0.00000000780726 ***
## floors        28958.5249    3915.4655    7.40     0.00000000000015 ***
## condition     18383.6066    2586.6454    7.11     0.00000000000122 ***
## grade        131536.3956    2251.9728   58.41 < 0.0000000000000002 ***
## sqft_above      -21.9900       4.6116   -4.77     0.00000186852828 ***
## yr_built      -3953.4750      69.7857  -56.65 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 227000 on 21603 degrees of freedom
## Multiple R-squared:  0.618,  Adjusted R-squared:  0.618 
## F-statistic: 3.88e+03 on 9 and 21603 DF,  p-value: <0.0000000000000002

61.8% of the variance is explained by the independent variable in the full linear model. It is a bit better than the PCR model.

5.4 PCR vs. Full Linear Model: A Comparison

The R^2 of the full model is 0.618, which is higher than PCR at 0.613. We can also see that both models underestimate the value of house prices over 4e+06.

5.5 Limitations of PCA

There are a few price outliers. We did not delete them, however, because we think the high prices are important. The response variables must be numeric, even though the grade and condition variables are actually catagorical. PCA also relies on the assumption of normality, but our dataset is heavily right skewed.

6 Ridge and Lasso

6.1 The Ridge

For our dataset, numbers of bedrooms, bathrooms, and floors are categorical variables, so we convert them into factors. Then we prepare a log scale grid for λ values, from 10^10 to 10^-2 in 100 segments, and then build the ridge model. Afterwards, we draw a plot of coefficients to see the overall trend.

## [1]  64 100

The glmnet( ) function creates 100 models, with our choice of 100 \(\lambda\) values. Each model’s coefficients are stored in the object we named: ridge.mod
There are 55 coefficients for each model. The 100 \(\lambda\) values are chosen from 0.02 (\(10^{-2}\)) to \(10^{10}\), essentially covering the ordinary least square model (\(\lambda\) = 0), and the null/constant model (\(\lambda\) approaches infinity).

Because the ridge regression uses the “L2 norm”, the coefficients are expected to be smaller when \(\lambda\) is large. Our “midpoint” (the 50th percentile) of \(\lambda\) equals 11497.57, and the sum of squares of coefficients is 0.002. Compared to the 60th percentile value (we have a decreasing sequence) \(\lambda\) of 705.48, we find the sum of squares of the coefficients to be 0.04, about 16 times larger.

The model, however, only has 100 different values of \(\lambda\) recorded, so we can use the predict function (part of the R basic stats library) for various different purposes, such as calculating the predicted coefficients for \(\lambda\)=50, for example.

##   (Intercept)     bedrooms2     bedrooms3     bedrooms4     bedrooms5 
##    -0.0006179    -0.0076927    -0.0067785     0.0068163     0.0134016 
##     bedrooms6     bedrooms7     bedrooms8     bedrooms9    bedrooms10 
##     0.0144712     0.0202850     0.0290019     0.0175400     0.0137793 
##    bedrooms11 bathrooms0.75    bathrooms1 bathrooms1.25  bathrooms1.5 
##    -0.0019261    -0.0123238    -0.0116420     0.0047366    -0.0070193 
## bathrooms1.75    bathrooms2 bathrooms2.25  bathrooms2.5 bathrooms2.75 
##    -0.0049456    -0.0045507    -0.0004188     0.0004906     0.0062803 
##    bathrooms3 bathrooms3.25  bathrooms3.5 bathrooms3.75    bathrooms4 
##     0.0087716     0.0227680     0.0205608     0.0341827     0.0377144 
## bathrooms4.25  bathrooms4.5 bathrooms4.75    bathrooms5 bathrooms5.25 
##     0.0512497     0.0410413     0.0774969     0.0588491     0.0665617 
##  bathrooms5.5 bathrooms5.75    bathrooms6 bathrooms6.25  bathrooms6.5 
##     0.1033113     0.1011313     0.1264703     0.1325946     0.0590444 
## bathrooms6.75  bathrooms7.5 bathrooms7.75    bathrooms8   sqft_living 
##     0.1137092    -0.0062946     0.3369097     0.2325487     0.0132109 
##      sqft_lot     floors1.5       floors2     floors2.5       floors3 
##     0.0016311     0.0014869     0.0085804     0.0272498     0.0025484 
##     floors3.5    condition2    condition3    condition4    condition5 
##     0.0195488    -0.0108018    -0.0000823    -0.0010613     0.0044430 
##        grade4        grade5        grade6        grade7        grade8 
##    -0.0166845    -0.0150763    -0.0133822    -0.0120699    -0.0000254

6.1.1 Train and Test sets

Let us split the data into training and test sets, so that we can estimate test errors. The split will be used here for Ridge regression, and later for Lasso regression.

The test set mean squared error (MSE) is 0.571. (Keep in mind that we are using standardized scores for \(\lambda = 4\).)

On the other hand, for the null model (\(\lambda\) approaches infinity), the MSE can be found to be 0.978. So \(\lambda = 4\) reduces the variance by about half, at the expense of bias.

We could have also used a large \(\lambda\) value to find the MSE for the null model. These two methods yield essentially the same answer of 0.978.

## [1] 0.34
##   (Intercept)     bedrooms2     bedrooms3     bedrooms4     bedrooms5 
##       0.04818       0.09171      -0.04770      -0.14763      -0.12726 
##     bedrooms6     bedrooms7     bedrooms8     bedrooms9    bedrooms10 
##      -0.29521      -0.33546      -0.29151      -0.44392      -0.38316 
##    bedrooms11 bathrooms0.75    bathrooms1 bathrooms1.25  bathrooms1.5 
##      -0.97528      -0.02447      -0.04747      -0.41652      -0.05375 
## bathrooms1.75    bathrooms2 bathrooms2.25  bathrooms2.5 bathrooms2.75 
##      -0.04096      -0.04077       0.01444      -0.03759      -0.00733 
##    bathrooms3 bathrooms3.25  bathrooms3.5 bathrooms3.75    bathrooms4 
##       0.07802       0.22207       0.13623       0.50483       0.18360 
## bathrooms4.25  bathrooms4.5 bathrooms4.75    bathrooms5 bathrooms5.25 
##       0.60370       0.62950       2.39920       1.00729       1.15719 
##  bathrooms5.5 bathrooms5.75    bathrooms6 bathrooms6.25  bathrooms6.5 
##       2.22582       2.05909       1.96062       3.91087       0.29896 
## bathrooms6.75  bathrooms7.5 bathrooms7.75    bathrooms8   sqft_living 
##       1.01433       0.00000       0.00000       0.00000       0.19787 
##      sqft_lot     floors1.5       floors2     floors2.5       floors3 
##      -0.03167       0.04810       0.06720       0.17923       0.40364 
##     floors3.5    condition2    condition3    condition4    condition5 
##       0.56829      -0.03787       0.00224       0.05391       0.20033 
##        grade4        grade5        grade6        grade7        grade8 
##      -0.68443      -0.70236      -0.55870      -0.29205      -0.01155

Now for the other extreme special case of small \(\lambda\), which is the ordinary least square (OLS) model. We can first use the ridge regression result to predict the \(\lambda\) =0 case. The MSE was found to be 0.34 using this result.

We can also build the OLS model directly.

## 
## Call:
## lm(formula = price ~ ., data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.441 -0.295 -0.033  0.229 11.815 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   -1.14750    0.47833   -2.40               0.0165 *  
## bedrooms2      0.02108    0.06851    0.31               0.7583    
## bedrooms3     -0.12205    0.06861   -1.78               0.0753 .  
## bedrooms4     -0.22602    0.06990   -3.23               0.0012 ** 
## bedrooms5     -0.20698    0.07313   -2.83               0.0047 ** 
## bedrooms6     -0.37573    0.08861   -4.24   0.0000224913062269 ***
## bedrooms7     -0.42131    0.15453   -2.73               0.0064 ** 
## bedrooms8     -0.37277    0.25578   -1.46               0.1450    
## bedrooms9     -0.54117    0.35390   -1.53               0.1263    
## bedrooms10    -0.46958    0.42841   -1.10               0.2731    
## bedrooms11    -1.06635    0.60147   -1.77               0.0763 .  
## bathrooms0.75  0.26012    0.43802    0.59               0.5526    
## bathrooms1     0.25266    0.42303    0.60               0.5503    
## bathrooms1.25 -0.14773    0.54481   -0.27               0.7863    
## bathrooms1.5   0.24768    0.42382    0.58               0.5590    
## bathrooms1.75  0.26148    0.42360    0.62               0.5371    
## bathrooms2     0.26222    0.42375    0.62               0.5361    
## bathrooms2.25  0.31744    0.42396    0.75               0.4540    
## bathrooms2.5   0.26635    0.42394    0.63               0.5298    
## bathrooms2.75  0.29587    0.42450    0.70               0.4858    
## bathrooms3     0.38135    0.42486    0.90               0.3694    
## bathrooms3.25  0.52316    0.42551    1.23               0.2189    
## bathrooms3.5   0.43729    0.42540    1.03               0.3040    
## bathrooms3.75  0.80645    0.42993    1.88               0.0607 .  
## bathrooms4     0.48134    0.42968    1.12               0.2626    
## bathrooms4.25  0.90334    0.43485    2.08               0.0378 *  
## bathrooms4.5   0.93206    0.43407    2.15               0.0318 *  
## bathrooms4.75  2.71209    0.47125    5.76   0.0000000088988964 ***
## bathrooms5     1.31154    0.46329    2.83               0.0046 ** 
## bathrooms5.25  1.45563    0.50302    2.89               0.0038 ** 
## bathrooms5.5   2.53299    0.49379    5.13   0.0000002952291571 ***
## bathrooms5.75  2.36784    0.55791    4.24   0.0000221296117797 ***
## bathrooms6     2.25868    0.52767    4.28   0.0000188134408722 ***
## bathrooms6.25  4.24288    0.73613    5.76   0.0000000084529191 ***
## bathrooms6.5   0.58420    0.73839    0.79               0.4289    
## bathrooms6.75  1.30589    0.60838    2.15               0.0319 *  
## sqft_living    0.46228    0.01692   27.32 < 0.0000000000000002 ***
## sqft_lot      -0.03222    0.00578   -5.58   0.0000000248160754 ***
## floors1.5      0.04522    0.02279    1.98               0.0473 *  
## floors2        0.06952    0.01921    3.62               0.0003 ***
## floors2.5      0.17378    0.06724    2.58               0.0098 ** 
## floors3        0.41135    0.03874   10.62 < 0.0000000000000002 ***
## floors3.5      0.57692    0.42213    1.37               0.1718    
## condition2     0.19765    0.18269    1.08               0.2793    
## condition3     0.23938    0.17339    1.38               0.1674    
## condition4     0.29027    0.17346    1.67               0.0943 .  
## condition5     0.43625    0.17433    2.50               0.0123 *  
## grade5         0.01393    0.15287    0.09               0.9274    
## grade6         0.16029    0.14467    1.11               0.2679    
## grade7         0.43578    0.14477    3.01               0.0026 ** 
## grade8         0.72084    0.14563    4.95   0.0000007537853680 ***
## grade9         1.14077    0.14707    7.76   0.0000000000000095 ***
## grade10        1.63552    0.14935   10.95 < 0.0000000000000002 ***
## grade11        2.26268    0.15543   14.56 < 0.0000000000000002 ***
## grade12        3.39286    0.17457   19.44 < 0.0000000000000002 ***
## grade13        4.12985    0.30443   13.57 < 0.0000000000000002 ***
## sqft_above    -0.08974    0.01531   -5.86   0.0000000046825032 ***
## sqft_basement       NA         NA      NA                   NA    
## yr_built      -0.23916    0.00927  -25.80 < 0.0000000000000002 ***
## yr_renovated   0.04127    0.00624    6.61   0.0000000000393945 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.596 on 10739 degrees of freedom
## Multiple R-squared:  0.655,  Adjusted R-squared:  0.653 
## F-statistic:  351 on 58 and 10739 DF,  p-value: <0.0000000000000002

The MSE for OLS regression is 0.353

6.1.2 Use Cross-validation

We use a built-in cross-validation method with glmnet, which will select the minimal \(\lambda\) value.

The minimal \(\lambda\) value minimizing training MSE results to be 0.071 in this case.

## [1] 0.342
##   (Intercept)     bedrooms2     bedrooms3     bedrooms4     bedrooms5 
##       0.04348       0.12109       0.00438      -0.07223      -0.06973 
##     bedrooms6     bedrooms7     bedrooms8     bedrooms9    bedrooms10 
##      -0.22814      -0.51888       0.01736      -0.45761      -0.43431 
##    bedrooms11 bathrooms0.75    bathrooms1 bathrooms1.25  bathrooms1.5 
##      -0.80846      -0.00521      -0.06374       0.23833      -0.05507 
## bathrooms1.75    bathrooms2 bathrooms2.25  bathrooms2.5 bathrooms2.75 
##      -0.04347      -0.04695       0.00590      -0.04798       0.00637 
##    bathrooms3 bathrooms3.25  bathrooms3.5 bathrooms3.75    bathrooms4 
##       0.06353       0.27259       0.13974       0.48025       0.41569 
## bathrooms4.25  bathrooms4.5 bathrooms4.75    bathrooms5 bathrooms5.25 
##       0.69822       0.52011       1.42889       0.93785       1.24494 
##  bathrooms5.5 bathrooms5.75    bathrooms6 bathrooms6.25  bathrooms6.5 
##       1.63620       1.17970       2.68379       1.48659      -0.27442 
## bathrooms6.75  bathrooms7.5 bathrooms7.75    bathrooms8   sqft_living 
##       1.24617      -0.11642       9.60917       3.84458       0.19455 
##      sqft_lot     floors1.5       floors2     floors2.5       floors3 
##      -0.02574       0.06864       0.06276       0.35401       0.32881 
##     floors3.5    condition2    condition3    condition4    condition5 
##       0.50963      -0.11889      -0.05088       0.01750       0.14366 
##        grade4        grade5        grade6        grade7        grade8 
##      -0.56617      -0.58286      -0.49016      -0.27727      -0.01932 
##        grade9       grade10       grade11       grade12       grade13 
##       0.36524       0.81265       1.42433       2.54973       4.30675 
##    sqft_above sqft_basement      yr_built  yr_renovated 
##       0.14404       0.11795      -0.20027       0.04575

The first vertical dotted line shows that the lowest MSE is 0.342. The second vertical dotted line is within one standard error. Then we calculate the R squared value. R squared is 0.65 for the ridge model.

6.2 The Lasso

The same function, glmnet( ), with alpha set to 1 will build the Lasso regression model. Then we draw the plot for different \(\lambda\) values to see the overall trend.
Here, we see that the lowest MSE is when \(\lambda\) equals 0.369. It has about 47 non-zero coefficients.

## [1] 0.344
##   (Intercept)     bedrooms2     bedrooms3     bedrooms4     bedrooms5 
##       0.01286       0.09360       0.00000      -0.04818       0.00000 
##     bedrooms6     bedrooms7     bedrooms8     bedrooms9    bedrooms10 
##      -0.07138      -0.11450       0.00000       0.00000       0.00000 
##    bedrooms11 bathrooms0.75    bathrooms1 bathrooms1.25  bathrooms1.5 
##       0.00000       0.00000       0.00000       0.00000       0.00000 
## bathrooms1.75    bathrooms2 bathrooms2.25  bathrooms2.5 bathrooms2.75 
##       0.00000       0.00000       0.00000      -0.02710       0.00000 
##    bathrooms3 bathrooms3.25  bathrooms3.5 bathrooms3.75    bathrooms4 
##       0.00000       0.18498       0.03274       0.29635       0.18297 
## bathrooms4.25  bathrooms4.5 bathrooms4.75    bathrooms5 bathrooms5.25 
##       0.42824       0.23767       0.99029       0.46857       0.67974 
##  bathrooms5.5 bathrooms5.75    bathrooms6 bathrooms6.25  bathrooms6.5 
##       0.91052       0.06104       1.95610       0.06491       0.00000 
## bathrooms6.75  bathrooms7.5 bathrooms7.75    bathrooms8   sqft_living 
##       0.00000       0.00000       8.02174       2.24596       0.39307 
##      sqft_lot     floors1.5       floors2     floors2.5       floors3 
##      -0.01899       0.00288       0.02334       0.22601       0.31298 
##     floors3.5    condition2    condition3    condition4    condition5 
##       0.01047      -0.03239      -0.03053       0.00000       0.11863 
##        grade4        grade5        grade6        grade7        grade8 
##      -0.28065      -0.51219      -0.47415      -0.26078       0.00000 
##        grade9       grade10       grade11       grade12       grade13 
##       0.37851       0.85741       1.52279       2.72151       4.68689 
##    sqft_above sqft_basement      yr_built  yr_renovated 
##       0.00000       0.02922      -0.20745       0.03924
##   (Intercept)     bedrooms2     bedrooms4     bedrooms6     bedrooms7 
##       0.01286       0.09360      -0.04818      -0.07138      -0.11450 
##  bathrooms2.5 bathrooms3.25  bathrooms3.5 bathrooms3.75    bathrooms4 
##      -0.02710       0.18498       0.03274       0.29635       0.18297 
## bathrooms4.25  bathrooms4.5 bathrooms4.75    bathrooms5 bathrooms5.25 
##       0.42824       0.23767       0.99029       0.46857       0.67974 
##  bathrooms5.5 bathrooms5.75    bathrooms6 bathrooms6.25 bathrooms7.75 
##       0.91052       0.06104       1.95610       0.06491       8.02174 
##    bathrooms8   sqft_living      sqft_lot     floors1.5       floors2 
##       2.24596       0.39307      -0.01899       0.00288       0.02334 
##     floors2.5       floors3     floors3.5    condition2    condition3 
##       0.22601       0.31298       0.01047      -0.03239      -0.03053 
##    condition5        grade4        grade5        grade6        grade7 
##       0.11863      -0.28065      -0.51219      -0.47415      -0.26078 
##        grade9       grade10       grade11       grade12       grade13 
##       0.37851       0.85741       1.52279       2.72151       4.68689 
## sqft_basement      yr_built  yr_renovated 
##       0.02922      -0.20745       0.03924

We then calculate the R squared of lasso regression, which is 0.648.

Lasso regression is also a good tool for feature selection. So we build a linear model by using lasso to select variables.

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + condition + grade + sqft_basement + yr_built + yr_renovated, 
##     data = kc_house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2297337  -107678   -12103    84167  4423367 
## 
## Coefficients:
##                   Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)   5786132.6014  287169.2537   20.15 < 0.0000000000000002 ***
## bedrooms2        5000.7996   16364.1250    0.31              0.75992    
## bedrooms3      -42266.0406   16372.6413   -2.58              0.00984 ** 
## bedrooms4      -77996.4955   16713.0612   -4.67  0.00000307769658706 ***
## bedrooms5      -79967.7532   17595.1577   -4.54  0.00000552672670617 ***
## bedrooms6     -145434.5049   21451.2933   -6.78  0.00000000001234956 ***
## bedrooms7     -261655.2281   39396.0402   -6.64  0.00000000003175618 ***
## bedrooms8      -55905.0419   62457.8929   -0.90              0.37075    
## bedrooms9     -257155.1102   97573.2300   -2.64              0.00841 ** 
## bedrooms10    -228865.1782  125830.2877   -1.82              0.06895 .  
## bedrooms11    -376317.8418  213834.1827   -1.76              0.07845 .  
## bathrooms0.75  120775.3931  109765.0661    1.10              0.27121    
## bathrooms1      96789.8822  106610.1309    0.91              0.36395    
## bathrooms1.25  198630.3519  128075.2853    1.55              0.12094    
## bathrooms1.5   100835.8044  106757.7965    0.94              0.34491    
## bathrooms1.75  106680.6104  106699.1434    1.00              0.31741    
## bathrooms2     106596.2395  106735.7985    1.00              0.31795    
## bathrooms2.25  126046.4763  106764.4372    1.18              0.23777    
## bathrooms2.5   107924.7623  106735.6188    1.01              0.31196    
## bathrooms2.75  127022.2004  106878.5555    1.19              0.23466    
## bathrooms3     148479.3603  106994.8134    1.39              0.16524    
## bathrooms3.25  220325.3718  107135.6109    2.06              0.03975 *  
## bathrooms3.5   171417.8775  107095.7088    1.60              0.10948    
## bathrooms3.75  300179.5560  108209.0164    2.77              0.00554 ** 
## bathrooms4     270502.4083  108467.9347    2.49              0.01264 *  
## bathrooms4.25  376695.2129  109667.6338    3.43              0.00059 ***
## bathrooms4.5   316003.6531  109134.8687    2.90              0.00379 ** 
## bathrooms4.75  656581.3498  116105.9686    5.66  0.00000001578048641 ***
## bathrooms5     474915.5948  116867.4050    4.06  0.00004846948630483 ***
## bathrooms5.25  593633.4268  122811.4895    4.83  0.00000134944579199 ***
## bathrooms5.5   727072.1630  127683.4780    5.69  0.00000001254684228 ***
## bathrooms5.75  545572.0030  153024.0622    3.57              0.00036 ***
## bathrooms6    1134991.7320  139869.6650    8.11  0.00000000000000051 ***
## bathrooms6.25  650879.2683  188349.5092    3.46              0.00055 ***
## bathrooms6.5   -17186.0546  185503.2678   -0.09              0.92619    
## bathrooms6.75  567853.4806  186810.9333    3.04              0.00237 ** 
## bathrooms7.5   145092.2356  256965.4910    0.56              0.57233    
## bathrooms7.75 3828379.1387  248399.0807   15.41 < 0.0000000000000002 ***
## bathrooms8    1570473.5799  191547.0917    8.20  0.00000000000000026 ***
## sqft_living       138.0847       3.8255   36.10 < 0.0000000000000002 ***
## sqft_lot           -0.2596       0.0362   -7.16  0.00000000000081743 ***
## floors1.5       19188.7704    5780.9824    3.32              0.00090 ***
## floors2         27629.1245    4848.8142    5.70  0.00000001227173559 ***
## floors2.5      123157.5631   17500.7659    7.04  0.00000000000201930 ***
## floors3        136363.4875    9853.8560   13.84 < 0.0000000000000002 ***
## floors3.5      194254.8526   81197.8697    2.39              0.01675 *  
## condition2     -19588.7798   42913.0952   -0.46              0.64805    
## condition3       5853.5782   39923.4713    0.15              0.88343    
## condition4      29077.6308   39932.7519    0.73              0.46652    
## condition5      74263.2141   40168.0351    1.85              0.06450 .  
## grade4          47977.1099  217441.7739    0.22              0.82537    
## grade5          58346.9150  215109.2325    0.27              0.78621    
## grade6         103493.8750  214945.4149    0.48              0.63017    
## grade7         201590.6959  214977.4686    0.94              0.34839    
## grade8         308137.7426  215017.3003    1.43              0.15185    
## grade9         465726.1302  215080.0727    2.17              0.03037 *  
## grade10        645649.1516  215182.1027    3.00              0.00270 ** 
## grade11        892622.0158  215462.4534    4.14  0.00003443713090282 ***
## grade12       1337778.9614  216619.3422    6.18  0.00000000067044514 ***
## grade13       2017629.0156  225570.4451    8.94 < 0.0000000000000002 ***
## sqft_basement      37.9302       4.6529    8.15  0.00000000000000038 ***
## yr_built        -3014.3776      79.9195  -37.72 < 0.0000000000000002 ***
## yr_renovated       36.2602       3.8886    9.32 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 213000 on 21533 degrees of freedom
## Multiple R-squared:  0.665,  Adjusted R-squared:  0.664 
## F-statistic:  689 on 62 and 21533 DF,  p-value: <0.0000000000000002
##     bedrooms2     bedrooms3     bedrooms4     bedrooms5     bedrooms6 
##         14.22         31.67         28.90         10.13          2.73 
##     bedrooms7     bedrooms8     bedrooms9    bedrooms10    bedrooms11 
##          1.30          1.12          1.26          1.05          1.01 
## bathrooms0.75    bathrooms1 bathrooms1.25  bathrooms1.5 bathrooms1.75 
##         18.81        793.56          3.26        339.08        657.43 
##    bathrooms2 bathrooms2.25  bathrooms2.5 bathrooms2.75    bathrooms3 
##        441.80        466.05       1015.12        282.29        183.58 
## bathrooms3.25  bathrooms3.5 bathrooms3.75    bathrooms4 bathrooms4.25 
##        145.10        178.74         39.76         35.08         20.89 
##  bathrooms4.5 bathrooms4.75    bathrooms5 bathrooms5.25  bathrooms5.5 
##         26.16          6.83          6.32          4.32          3.60 
## bathrooms5.75    bathrooms6 bathrooms6.25  bathrooms6.5 bathrooms6.75 
##          2.07          2.59          1.57          1.52          1.54 
##  bathrooms7.5 bathrooms7.75    bathrooms8   sqft_living      sqft_lot 
##          1.46          1.36          1.62          5.88          1.07 
##     floors1.5       floors2     floors2.5       floors3     floors3.5 
##          1.28          2.64          1.08          1.27          1.02 
##    condition2    condition3    condition4    condition5        grade4 
##          6.85        172.97        147.24         55.76         28.13 
##        grade5        grade6        grade7        grade8        grade9 
##        244.31       1881.57       5348.36       4449.50       2345.99 
##       grade10       grade11       grade12       grade13 sqft_basement 
##       1097.76        401.17         91.77         14.59          2.02 
##      yr_built  yr_renovated 
##          2.63          1.16

The condition p-values are all lower than 0.05. We consequently remove them from the model and rebuild it.

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + grade + sqft_basement + yr_built + yr_renovated, 
##     data = kc_house_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2299861  -107724   -11657    84378  4411029 
## 
## Coefficients:
##                   Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)   6390398.7826  281343.4047   22.71 < 0.0000000000000002 ***
## bedrooms2        8793.1555   16411.3820    0.54              0.59210    
## bedrooms3      -36077.6335   16412.8287   -2.20              0.02795 *  
## bedrooms4      -71532.7566   16754.2911   -4.27  0.00001967336217461 ***
## bedrooms5      -74411.1623   17640.4167   -4.22  0.00002472445036141 ***
## bedrooms6     -142953.7615   21515.0023   -6.64  0.00000000003117787 ***
## bedrooms7     -258423.2081   39524.7387   -6.54  0.00000000006363165 ***
## bedrooms8      -55473.9576   62666.8420   -0.89              0.37605    
## bedrooms9     -278523.9215   97883.1397   -2.85              0.00444 ** 
## bedrooms10    -227727.6995  126239.8568   -1.80              0.07126 .  
## bedrooms11    -383449.7239  214555.7074   -1.79              0.07392 .  
## bathrooms0.75  130401.2659  110127.9946    1.18              0.23639    
## bathrooms1     101396.6106  106966.1914    0.95              0.34318    
## bathrooms1.25  214387.1242  128495.1322    1.67              0.09524 .  
## bathrooms1.5   109925.9055  107111.6944    1.03              0.30477    
## bathrooms1.75  118404.4415  107051.3532    1.11              0.26872    
## bathrooms2     119919.4240  107086.1173    1.12              0.26279    
## bathrooms2.25  139025.1153  107115.6382    1.30              0.19434    
## bathrooms2.5   120305.1066  107087.1231    1.12              0.26127    
## bathrooms2.75  142766.2980  107227.2671    1.33              0.18306    
## bathrooms3     161875.0630  107346.0986    1.51              0.13158    
## bathrooms3.25  234468.4101  107486.7802    2.18              0.02917 *  
## bathrooms3.5   183845.2189  107447.5300    1.71              0.08709 .  
## bathrooms3.75  313888.2729  108564.3299    2.89              0.00384 ** 
## bathrooms4     285206.2049  108822.5754    2.62              0.00878 ** 
## bathrooms4.25  391992.6328  110026.5062    3.56              0.00037 ***
## bathrooms4.5   330613.4208  109491.8211    3.02              0.00253 ** 
## bathrooms4.75  670750.4580  116488.1711    5.76  0.00000000862231213 ***
## bathrooms5     486169.3041  117254.1868    4.15  0.00003392061401319 ***
## bathrooms5.25  608173.3283  123212.7876    4.94  0.00000080353413071 ***
## bathrooms5.5   740445.7587  128106.9263    5.78  0.00000000757747603 ***
## bathrooms5.75  555121.0099  153534.8034    3.62              0.00030 ***
## bathrooms6    1146755.7758  140335.5532    8.17  0.00000000000000032 ***
## bathrooms6.25  663880.3834  188978.6532    3.51              0.00044 ***
## bathrooms6.5    -4874.1796  186123.2253   -0.03              0.97911    
## bathrooms6.75  567544.9154  187437.6867    3.03              0.00247 ** 
## bathrooms7.5   179930.9422  257815.0502    0.70              0.48524    
## bathrooms7.75 3844440.4336  249230.9862   15.43 < 0.0000000000000002 ***
## bathrooms8    1580314.5517  192191.3916    8.22 < 0.0000000000000002 ***
## sqft_living       138.6162       3.8359   36.14 < 0.0000000000000002 ***
## sqft_lot           -0.2583       0.0363   -7.11  0.00000000000119662 ***
## floors1.5       17634.6404    5788.3156    3.05              0.00232 ** 
## floors2         23824.3958    4833.3356    4.93  0.00000083187705496 ***
## floors2.5      120957.7694   17553.5690    6.89  0.00000000000570189 ***
## floors3        133102.5298    9863.2196   13.49 < 0.0000000000000002 ***
## floors3.5      195218.9047   81467.8489    2.40              0.01657 *  
## grade4         -10520.8101  218092.4299   -0.05              0.96153    
## grade5          10015.4782  215780.0375    0.05              0.96298    
## grade6          55094.9696  215622.1274    0.26              0.79833    
## grade7         152646.3785  215652.8471    0.71              0.47906    
## grade8         258817.8723  215692.0607    1.20              0.23018    
## grade9         416117.1073  215755.0687    1.93              0.05379 .  
## grade10        594836.1060  215856.6898    2.76              0.00586 ** 
## grade11        840338.3056  216135.2163    3.89              0.00010 ***
## grade12       1285016.8472  217296.0725    5.91  0.00000000339622408 ***
## grade13       1959427.7048  226272.2209    8.66 < 0.0000000000000002 ***
## sqft_basement      40.3123       4.6630    8.65 < 0.0000000000000002 ***
## yr_built        -3295.5874      76.1924  -43.25 < 0.0000000000000002 ***
## yr_renovated       28.4576       3.8364    7.42  0.00000000000012345 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 214000 on 21537 degrees of freedom
## Multiple R-squared:  0.662,  Adjusted R-squared:  0.662 
## F-statistic:  729 on 58 and 21537 DF,  p-value: <0.0000000000000002
##     bedrooms2     bedrooms3     bedrooms4     bedrooms5     bedrooms6 
##         14.22         31.67         28.90         10.13          2.73 
##     bedrooms7     bedrooms8     bedrooms9    bedrooms10    bedrooms11 
##          1.30          1.12          1.26          1.05          1.01 
## bathrooms0.75    bathrooms1 bathrooms1.25  bathrooms1.5 bathrooms1.75 
##         18.81        793.56          3.26        339.08        657.43 
##    bathrooms2 bathrooms2.25  bathrooms2.5 bathrooms2.75    bathrooms3 
##        441.80        466.05       1015.12        282.29        183.58 
## bathrooms3.25  bathrooms3.5 bathrooms3.75    bathrooms4 bathrooms4.25 
##        145.10        178.74         39.76         35.08         20.89 
##  bathrooms4.5 bathrooms4.75    bathrooms5 bathrooms5.25  bathrooms5.5 
##         26.16          6.83          6.32          4.32          3.60 
## bathrooms5.75    bathrooms6 bathrooms6.25  bathrooms6.5 bathrooms6.75 
##          2.07          2.59          1.57          1.52          1.54 
##  bathrooms7.5 bathrooms7.75    bathrooms8   sqft_living      sqft_lot 
##          1.46          1.36          1.62          5.88          1.07 
##     floors1.5       floors2     floors2.5       floors3     floors3.5 
##          1.28          2.64          1.08          1.27          1.02 
##    condition2    condition3    condition4    condition5        grade4 
##          6.85        172.97        147.24         55.76         28.13 
##        grade5        grade6        grade7        grade8        grade9 
##        244.31       1881.57       5348.36       4449.50       2345.99 
##       grade10       grade11       grade12       grade13 sqft_basement 
##       1097.76        401.17         91.77         14.59          2.02 
##      yr_built  yr_renovated 
##          2.63          1.16

The R squared value is 0.662, which is better than what was achieved with ridge and lasso regression.

7 Conclusion

In the end, each of our models comes with its advantages and disadvantages. Although KNN proved 74% accurate at classifying prices into “low”, “medium” and “high” categories, these categories ultimately do not tell us much, considering their large ranges. Decision trees provide simple visualizations and tell us which features are the most import, but it oversimplies the dataset and yields a low amount of variance explained (even with the random forest included). PCA and PCR yield relatively low amounts of variance explained (around 60%), and they do not differ much from the variance explained by a full linear model. The ridge and lasso regressions also do not perform as well as the full linear model (they have lower R2 values). To summarize our findings, in general, linear regression tends to offer the most explanatory power, and “sqft_living” and “grade” seem to influence price the most. This makes intuitive sense: living space and quality of construction are the most important variables when it comes to housing price, and the simple linear nature of this regression problem accepts this simple predictive model as quite efficient and strong indeed.

8 Bibliography

Dataset available: https://www.kaggle.com/harlfoxem/housesalesprediction/data

Perry, M. J. (2016, June 5). New US homes today are 1,000 square feet larger than in 1973 and living space per person has nearly doubled. Retrieved from https://www.aei.org/carpe-diem/new-us-homes-today-are-1000-square-feet-larger-than-in-1973-and-living-space-per-person-has-nearly-doubled/

Roberts, D. (n.d.). Variance and Standard Deviation. Retrieved from https://mathbitsnotebook.com/Algebra1/StatisticsData/STSD.html

Rosenberg, M. (2018, July 31). Seattle-area home prices this spring rose at fastest rate since 2006 bubble. The Seattle Times. Retrieved from https://www.seattletimes.com/business/real-estate/seattle-area-home-prices-this-spring-rose-at-fastest-rate-since-2006-bubble/

Anti-Code Group

11/26/2019